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Abstract 

An algebraic integer (AI) based time-multiplexed row-parallel architecture and two final-reconstruction step 
(FRS) algorithms are proposed for the implementation of bivariate Al-encoded 2-D discrete cosine transform (DCT). 
The architecture directly realizes an error-free 2-D DCT without using FRSs between row-column transforms, leading 
to an 8x8 2-D DCT which is entirely free of quantization errors in AI basis. As a result, the user-selectable accuracy 
for each of the coefficients in the FRS facilitates each of the 64 coefficients to have its precision set independently of 
others, avoiding the leakage of quantization noise between channels as is the case for published DCT designs. The 
proposed FRS uses two approaches based on (i) optimized Dempster-Macleod multipliers and (ii) expansion factor 
scaling. This architecture enables low-noise high-dynamic range applications in digital video processing that requires 
full control of the finite-precision computation of the 2-D DCT. The proposed architectures and FRS techniques are 
experimentally verified and validated using hardware implementations that are physically realized and verified on 
FPGA chip. Six designs, for 4- and 8-bit input word sizes, using the two proposed FRS schemes, have been designed, 
simulated, physically implemented and measured. The maximum clock rate and block-rate achieved among 8-bit 
input designs are 307.787 MHz and 38.47 MHz, respectively, implying a pixel rate of 8x307.787^2.462 GHz if 
eventually embedded in a real-time video-processing system. The equivalent frame rate is about 1187.35 Hz for the 
image size of 1920x 1080. All implementations are functional on a Xilinx Virtex-6 XC6VLX240T FPGA device. 
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1 Introduction 

High-quality digital video in multimedia devices and video-over-IP networks connected to the Internet are under expo¬ 
nential growth and therefore the demand for applications capable of high dynamic range (HDR) video is accordingly 
increasing. Some HDR imaging applications include automatic surveillance CHI, geospatial remote sensing Q, traf¬ 
fic cameras ii, homeland security a, satellite based imaging IjTHll, unmanned aerial vehicles ifToW^ . automotive 
industry lfT3]| . and multimedia wireless sensor networks lfT4l . Such HDR video systems operating at high resolutions 
require an associate hardware capable of significant throughput at allowable area-power complexity. 
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Efficient codec circuits capable of both high-speeds of operation and high numerical accuracy are needed for next- 
generation systems. Such systems may process massive amounts of video feeds, each at high resolution, with minimal 
noise and distortion while consuming as little energy as possible El. 

The two-dimensional (2-D) discrete cosine transform (DCT) operation is fundamental to almost all real-time video 
compression systems. The circuit realization of the DCT directly relates to noise, distortion, circuit area, and power 
consumption of the related video devices El- Usually, the 2-D DCT is computed by successive calls of the one¬ 
dimensional (1-D) DCT applied to the columns of an 8x8 sub-image; then to the rows of the transposed resulting 
intermediate calculation El- The VLSI implementation of trigonometric transforms such as DCT and DFT is indeed 
an active research area EHSa. 

An ideal 8-point 1-D DCT requires multiplications by numbers in the form c[n] = cos(n;r/16), n = 0, 1, ... ,7. 
These constants impose computational difficulties in terms of number binary representation since they are not rational. 
Usual DCT implementations adopt a compromise solution to this problem employing truncation or rounding off ll34l 
Its! to approximate such quantities. Thus, instead of employing the exact value c[n], a quantized value is considered. 
Clearly, this operation introduces errors. 

One way of addressing this problem is to employ algebraic integer (AI) encoding II36II37I . Al-encoding philos¬ 
ophy consists of mapping possibly irrational numbers to array of integers, which can be arithmetically manipulated 
without errors. Also, depending on the numbers to be encoded, this mapping can be exact. For example, all 8-point 
DCT multipliers can be given an exact AI representation Eventually, after computation is performed, Al-based 
algorithms require a hnal reconstruction step (FRS) in order to map the resulting encoded integer arrays back into 
usual hxed-point representation at a given precision ll^ . 

Besides the numerical representation issues, error propagation also plays a role. In particular, when considering 
the hxed-point realization of the multiplication operation, quantization errors are prone to be amplihed in the DCT 
computation Il39ll40l . Quantization noise at a particular 2-D DCT coefficient can have signihcant correlation with 
noise in other coefficients depending on the statistics of the video signal of interest O31ll33ll^l40l . Combating noise 
injection, noise coupling, and noise amplihcation is a concern in a practical DCT implementation II311I331435II391|40]| . 

In EDia, Al-based procedures for the 2-D DCT are proposed. Their architecture was based on the low- 
complexity Arai algorithm ll43l . which formed the building-block of each 1-D DCT using AI number representa¬ 
tion. The Arai algorithm is a popular algorithm for video and image processing applications because of its relatively 
low computational complexity. It is noted that the 8-point Arai algorithm only needs hve multiplications to generate 
the eight output coefficients. Thus, we naturally choose this low complexity algorithm as a foundation for proposing 
optimized architectures having lower complexity and lower-noise. However, such design required the algebraically en¬ 
coded numbers to be reconstructed to their hxed-point format by the end of column-wise DCT calculation by means of 
an intermediate reconstruction step. Then data are re-coded to enter into the row-wise DCT calculation block 041II42I . 
This approach is not ideal because it introduces both numerical representation errors and error propagation from the 
intermediate FSR to subsequent blocks. 

We propose a digital hardware architecture for the 8x8 2-D DCT capable of (i) arbitrarily high numeric accuracy 
and (ii) high-throughput. To achieve these goals our design maintains the signal how free of quantization errors in all 
its intermediate computational steps by means of a novel doubly AI encoding concept. No intermediate reconstruc¬ 
tion step is introduced and the entire computation truly occurs over the AI structure. This prevents error propagation 
throughout intermediate computation, which would otherwise result in error correlation among the hnal DCT coeffi¬ 
cients. Thus errors are totally conhned to a single FRS that maps the resulting doubly AI encoded DCT coefficients 
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into fixed-point representations l36l. This procedure allows the selection of individual levels of precision for each of 
the 64 DCT spectral components at the FRS. At the same time, such flexibility does not affect noise levels or speed of 
other sections of the 2-D DCT. 

This works extends the 8-point 1-D Al-based DCT architecture 037114lll42l into a fully-parallel time-multiplexed 
2-D architecture for 8x8 data blocks. The fundamental differences are (i) the absence of any intermediate reconstruc¬ 
tion step; (ii) a new doubly AI encoding scheme; and (iii) the utilization of a single FRS. The proposed 2-D 8 x 8 ar¬ 
chitecture has the following characteristics; (i) independently selectable precision levels for the 2-D DCT coefficients; 
(ii) total absence of multiplication operations; and (iii) absence of leakage of quantization noise between coefficient 
channels. The proposed architectures aim at performing the FRS operation directly in the bi-variate encoded 2-D AI 
basis. We introduce designs based on (i) optimized Dempster-Macleod multipliers and on (ii) the expansion factor 
approach Il44l . All hardware implementations are designed to be realized on field programmable gate arrays (FPGAs) 
fromXilinx ll45l . 

This paper unfolds as follows. In Section|2]we review existing designs and the main theoretical points of number 
representation based on AI. We keep our focus on the core results needed for our design. Section[3brings a description 
of the proposed circuitry and hardware architecture in block level detail. In Section |4] strategies for obtaining the 
FRS block are proposed and described. Simulation results and actual test measurements are reported in Section |5] 
Concluding remarks are drawn in Section|6] 


2 Review 

The AI encoding was originally proposed for digital signal processing systems by Cozzens and Finkelstein Bhl . 
Since then it has been adapted for the VLSI implementation of the 1-D DCT and other trigonometric transforms by 
Julien et al. in BtI - ISTI . leading to a 1-D bivariate encoded Arai DCT algorithm by Wahid and Dimitrov Il37ll411l42l 
|5^ . Recently, subsequent contributions by Wahid et al. (using bivariate encoded 1-D Arai DCT blocks for row and 
column transforms of the 2-D DCT) has led to practical area-efficient VLSI video processing circuits with low-power 
consumption lElHSa. We now briefly summarize the state-of-the-art in both 1-D and 2-D DCT VLSI cores based on 
conventional fixed-point arithmetic as well as on AI encoding. 

2.1 Summary and Comparison with Literature 
2.1.1 Fixed-Point DCT VLSI Circuits 

A unified distributed-arithmetic parallel architecture for the computation of DCT and the DST was proposed in ll24l . A 
direct-connected 3-D VLSI architecture for the 2-D prime-factor DCT that does not need a transpose memory (buffer) 
is available in ll25l . A pioneering implementation at a clock of 100 MHz on 0.8 /rm CMOS technology for the 2-D 
DCT with block-size 8x8 which is suitable for HDTV applications is available in ifTTl . 

An efficient VLSI linear-array for both V-point DCT and IDCT using a subband decomposition algorithm that 
results in computational- and hardware-complexity of ff{5N/%) with FPGA realization is reported in ll20l . Recently, 
VLSI linear-array 2-D architectures and FPGA realizations having computation complexity ff{5N/%) (for forward 
DCT) was reported in ET\ . 

An efficient adder-based 2-D DCT core on 0.35 fj.m CMOS using cyclic convolution is described in ll29l . A 
high-performance video transform engine employing a space-time scheduling scheme for computing the 2-D DCT in 
real-time has been proposed and implemented in 0.18 /rm CMOS ||22|. A systolic-array algorithm using a memory 


3 


based design for both the DCT and the discrete sine transform which is suitable for real-time VLSI realization was 
proposed in lIT^ . An FPGA-based system-on-chip realization of the 2-D DCT for 8x8 block size that operates at 
107 MHz with a latency of 80 cycles is available in 1^ . A low-complexity IP core for quantized 8 x 8/4 x 4 DCT 
combined with MPEG4 codecs and FPGA synthesis is available in ||30l. “New distributed-arithmetic (NEDA)” based 
low-power 8x8 2-D DCT is reported in OTl . A reconfigurable processor on TSMC 0.13 /rm CMOS technology 
operating at 100 MHz is described in 1^ for the calculation of the fast Fourier transform and the 2-D DCT. A 
high-speed 2-D transform architecture based on NEDA technique and having unique kernel for multi-standard video 
processing is described in ll3?l . 


2.1.2 AI-based DCT VLSI Circuits 

The following AI-based realizations of 2-D DCT computation relies on the row- and column-wise application of 
1-D DCT cores that employ AI quantization BTHSTII . The architectures proposed by Wahid et al. rely on the low- 
complexity Arai Algorithm and lead to low-power realizations 041ll421l52l - l54l . However, these realizations also are 
based on repeated application along row and columns of an fundamental 1-D DCT building block having an FRS 
section at the output stage. Here, 8x8 2-D DCT refers to the use of bivariate encoding in the AI basis and not to the 
a true AI-based 2-D DCT operation. 

A 4 X 4 approximate 2-D-DCT using AI quantization is reported in lf5^ . Both FPGA implementation and ASIC 
synthesis on 90 nm CMOS results are provided. Although ll5^ employs AI encoding, it is not an error-free architecture. 
The low complexity of this architecture makes it suitable for H.264 realizations. 

2.2 Preliminaries eor Algebraic Integer Encoding and Decoding 

In order to prevent quantization noise, we adopt the AI representation. Such representation is based on a mapping 
function that links input numbers to integer arrays. 

This topic is a major and classic field in number theory. A famous exposition is due to Hardy and Wright ll57l 
Chap. XI and XIV], which is widely regarded as masterpiece on this subject for its clarity and depth. Pohst also brings 
a didactic explanation in ll58l with emphasis on computational realization. In ||59] p. 79], Pollard and Diamond devote 
an entire chapter to the connections between algebraic integers and integral basis. In the following, we furnish an 
overview focused on the practical aspects of AI, which may be useful for circuit designers. 

Definition 1 A real or complex number is called an algebraic integer if it is a root of a monic polynomial with integer 
coefficients 43&II57I/ . 

The set of algebraic integers have useful mathematical properties. For instance, they form a commutative ring, 
which means that addition and multiplication operations are commutative and also satisfies distribution over addition. 
A general AI encoding mapping has the following format 

/enc {x', if — 3, 

where a is a multidimensional array of integers and z is a fixed multidimensional array of algebraic integers. It can 
be shown that there always exist integers such that any real number can be represented with arbitrary precision ll46l . 
Also there are real numbers that can be represented without error. 
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Decoding operation is furnished by 


/dec (Si z) — a • Z, 


where the binary operation • is the generalized inner product — a component-wise inner product of multidimensional 
arrays. The elements of z constitute the AI basis. In hardware, decoding is often performed by an FRS block, where 
the AI basis z is represented as precisely as required. 


As an example, let the AI basis be such that z = 


1 Zl 


, where zi is the algebraic integer and the superscript ^ 

r 1 ^ 

denotes the transposition operation. Thus, a possible AI encoding mapping is /enc(-^;z) — a = ao ai , where ao 
and ai are integers. Encoded numbers are then represented by a 2-point vector of integers. Decoding operation is 
simply given by the usual inner product: x = a»x = ao + aizi- For example, the number 1 — 2v/2 has the following 
encoding: 


/enc 1-2^2; 


1 

V2 


1 

-2 


which is an exact representation. 

In principle, any number can be represented in an arbitrarily high precision 


However, within a limited 


dynamic range for the employed integers, not all numbers can be exactly encoded. For instance, considering the real 


number v/S, we have fenc{V^', 1 ) = 

very close, the representation is not exact: 


n T 


88 -61 


fda 


88 

-61 


1 

V2 


, where integers were limited to be 8-bit long. Although 


-73^:9.21 X 10“ 


In a similar way, the multipliers required by the DCT could be encoded into 2-point integer vectors: /enc (c[n]; z) = 
1 ^ 

ao[«] cii[n] ■ Given that the DCT constants are algebraic integers lf38l . an exact AI representation can be de¬ 
rived M- Thus, the integer sequences ao[n] and ai [n] can be easily realized in VFSI hardware. 

The multiplication between two numbers represented over an AI basis may be interpreted as a modular polynomial 
multiplication with respect to the monic polynomial that defines the AI basis. In the above particular illustrative 
example, consider the multiplication of the following pair of numbers ao + aiZi with bo -I- bizi, where bo and bi are 
integers. This operation is equivalent to the computation of the following expression: 


(oQ-l-aix) • (^0 + ^ 1 -^) (modx^ —2). 


Thus, existing algorithms for fast polynomial multiplication may be of consideration ll^ p. 311]. 

In practical terms, a good AI representation possesses a basis such that: (i) the required constants can be repre¬ 
sented without error; (ii) the integer elements provided by the representation are sufficiently small to allow a simple 
architecture design and fast signal processing; and (iii) the basis itself contains few elements to facilitate simple 
encoding-decoding operations. 

Other AI procedures allow the constants to be approximated, yielding much better options for encoding, at the cost 
of introducing error within the transform (before the FRS) IMl- 
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Table 1; 2-D AI encoding of Arai DCT constants 


c[4] c[6] c[2] —c[6] c[2]+c[6] 


0 o' 


■ 0 1' 


0 

O' 


0 

2 

0 1 


-1 0 


2 

0 


0 

, 0 


2.3 Bivariate AI Encoding 

Depending on the DCT algorithm employed, only the cosine of a few arcs are in fact required. We adopted the Arai 
DCT algorithm ||43]; and the required elements for this particular 1-D DCT method are only 037ll41ll42l : 


c[4] =cos^, 
1 % 


c[6] =cos—, 


67r 

c[2] - c[6] = cos — - cos —, 

r_i r^i 27r 67r 

c[l\ + c[6J = cos — H- cos —. 

16 16 

These particular values can be conveniently encoded as follows. Considering zi = \/2 + '/l + -\/2 — \/2 and 
Z 2 = ■\/2 + y/l — \/2 — \/2, we adopt the following 2-D array for AI encoding; 


1 

22 


Z\ 

ZlZ2 


This leads to a 2-D encoded coefficients of the form (scaled by 4): 


/enc(T,z) — Si — 


OO.O 

ao.i 


aift 

ai,i 


Such encoding is referred to as bivariate. For this specific AI basis, the required cosine values possess an error-free and 
sparse representation as given in Tablernil37ll7ni4^. Also we note that this representation utilizes very small integers 
and therefore is suitable for fast arithmetic computation. Moreover, these employed integers are powers of two, which 
require no hardware components other than wired-shifts, being cost-free. 

Encoding an arbitrary real number can be a sophisticated operation requiring the usage of look-up tables and greedy 
algorithms ||63l. Essentially, an exhaustive search is required to obtain the most accurate representation. However, 
integer numbers can be encoded effortlessly: 


/enc(lM;z) 


m 

0 


0 


( 1 ) 


where m is an integer. In this case, the encoding step is unnecessary. Our proposed design takes advantage of this 
property. 

For a given encoded number a, the decoding operation is simply expressed by: 


/dec(a;z) =a*z = ao ,0 + ai, 02 i +ao, 122 + 01 , 12122 - 
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Figure 1: 1-D AI Arai DCT block used in Fig.l^ Bdl . 

In terms of circuitry design, this operation is usually performed by the FRS. 

In order to reduce and simplify the employed notation, hereafter a superscript notation is used for identifying the 
bivariate AI encoded coefficients. For a given real x, we have the following representation 

= X = +x^‘^h2 +X^‘^'>ZIZ2, (2) 

where superscripts and indicate the encoded integers associated to basis elements 1, zi, Z 2 , and ziZ 2 , 

respectively. We denote this basis as Z 4 = {l,Zi,Z 2 ,ZiZ 2 }- 

It is worth to emphasize that in the 2-D AI encoding the equivalence between the algebraic integer multiplication 
and the polynomial modular multiplication does not hold true. Thus, a tailored computational technique to handle this 
operation must be developed. 


x{a) ^(b) 
^{c) ^{d) 


3 2-D AI DCT Architecture 

An 8 x 8 image block A has its 2-D DCT transform mathematically expressed by ESI: 

(3) 

where C is the usual DCT matrix B4l . It is straightforward to notice that this operation corresponds to the column¬ 
wise application of the 1-D DCT to the input image A, followed by a transposition, and then the row-wise application 
of the 1-D DCT to the resulted matrix. 

The 2-D DCT realizations in 041ll42ll64ll65l use the AI encoding scheme with decoding sections placed in between 
the row- and column-wise 1-D DCT operations. This intermediate reconstruction step leads to the introduction of 
quantization noise and cross-coupling of correlated noise components. In contrast, we employ a bivariate AI encoding, 
maintaining the computation over AI arithmetic to completely avoid arithmetic errors within the algorithm Eli. 

The proposed architecture consists of five sub-circuits ED: (i) an input decimator circuit; (ii) an 8 -point AI- 
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Figure 2: 1-D AI transpose buffer used in Fig. [3 



Decimation block AI Aral DCT 


AI Transposition and 
Cross-connection blocks 


AI Aral DCT FRS 


Figure 3: The 2-D AI-DCT consists of an input section having a decimation structure, 1-D 8-point AI-DCT block 
for column-wise DCTs, a real-time AI-TB, four parallel 1-D 8-point AI-DCT blocks for row-wise DCTs, and an FRS 

ED- 
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encoded 1-D DCT block shown in Fig.[T]which performs column-wise computation based on the Arai algorithm ll43ll 
and furnishes the intermediate result C • A in the AI domain; (iii) an Al-based transposition buffer shown in Fig. |2] with 
a wired cross-connection block for obtaining (C • A)^; (iv) four parallel instantiations of the same 8 -point Al-based 
Arai DCT block in Fig. [T]for row-wise computation of eight 1-D DCTs, which results in C • (C • A)^; and (v) an 
FRS circuit for mapping the Al-encoded 2-D DCT coefficients to 2’s complement format. The last transposition Q is 
obtained via wired cross-connections. The proposed architecture is shown in Fig. [3 

Our implementation covers items (ii)-(v) listed above. We now describe in detail each of the system blocks. 

3.1 Bit Serial Data Input, SerDes, and Decimation 

We assume that the input video data, in raster-scanned format, has already been split into 8 x 8 pixel blocks. We further 
assume that these blocks can be stacked to form an 8 -column and (8 x (number of blocks))-row data structure. This 
leads to so-called “blocked” video frames, each of size 8 x 8 pixels. The blocking procedure leads to a raster-scanned 
sequence of pixel intensity (or color) values Xi^„, ; = 0,1,...,7, n = 0,l,...,8x (number of blocks) — 1, from an 8 x 8 
blocked image. Notice that we use column-row order for the indexes, instead of row-column. Due to the 8 x 8 size of 
the 2-D DCT computation, we find it quite convenient to consider the time index n after a modular operation k = n 

(mod 8 ). Hereafter, we will refer to the time index as a modular quantity k = 0,1,..., 7,0,1,..., 7,0,1..., 7,_ 

The video signal is serially streamed through the input port of the architecture at a rate of Fj. A bit serial port 
connected to a serializer/deserializer (SerDes) is required to be fed using a bit rate of 8 x Fg without considering 
overheads. As an aside, we note that this input bit stream may be typically derived from optical fiber transmission or 
high throughput Ethernet ports driven at 9.6 Gbps. Following the SerDes, a decimation block converts the input byte 
sequence into a row structure by means of delaying and downsampling by eight as shown in Fig. [3 

Therefore, the raster-scanned input is decimated in time into eight parallel streams operating rate of Fdock = Fs/8; 
resulting in eight columns of the input block. It is important to emphasize that such input data consist of integer 
values. Thus, they are AI coded without any computation as shown in ([T])- The obtained column data is submitted to 
the column-wise application of the Al-based 1-D DCT. 

3.2 An 8-point AI-Encoded Arai DCT Core 

The column-wise transform operation is performed according to the 8 -point Al-based Arai DCT hardware cores as 
designed in 14111421 shown in Eig.[T] Here, this scheme is employed with the removal of its original FRS. The proposed 
2-D architecture employs an integer arithmetic entirely defined over the AI basis Z 4 . This transformation step operates 
at the reduced clock rate of Fdock- 

Indeed, the resulting AI encoded data components are split in four channels according to their Z 4 basis representa¬ 
tion EH. Such outputs are time-multiplexed mixed-domain partially computed spectral components. We denote them 
as Xi j}'‘^\ XiiS‘^\ where / = 0,1,... ,7 is the column index and k is the modular time index containing 

the information of the row number. 

In hardware, this means that the AI representation is contained in at most four parallel integer channels ED- Some 
quantities are known beforehand to require less than four AI encoded integers (cf. (|2])). Thus, in some cases, less than 
four connections are required. These channels are routed to the proposed Al-based transpose buffer (AI-TB) shown in 
Fig .m as a necessary pre-processing for the subsequent row-wise DCT calculation. 
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Figure 4; Row-wise DCT block that leads to the 2-D DCT of the 8x8 input blocks. 

3.3 Real-time AI-based Transpose Buffer 

Each partially computed transform component q G {a,h,c,£/}, from the column-wise DCT block is represented 

in Z 4 . Such encoded components are stored in the proposed AI-TB (shown in Fig. |2]only for channel which 
computes an 8x8 matrix transposition operation in real-time every eight clock cycles. 

The proposed AI-TB consists of a chain of clocked first-in-hrst-out (FIFO) buffers for each AI-based channel of 
each component of the column-wise transformation ED. For each parallel integer channel q, there are eight FIFO 
taps clocked at rate Fdock- Therefore, the set of FIFO buffers leads to 22 x 8 = 176 output ports from the FIFO buffer 
section. 

Hard wired cross-connections are used that physically realize the required transpose matrix for the next row-wise 
DCT section. These physical connections are encapsulated in the cross-connection block in Fig. [3] for brevity. The 
AI-TB is clocked at a rate of Fdock and yields a new 8x8 block of transposed data every 64 clock periods of the master 
clock Fs. Subsequently, the transposed Al-encoded elements are submitted to four 1-D AI DCT cores operating in 
parallel. 


3.4 Row- WISE DCT Computation 

After route cross-connection, the output taps from the transposition operation are connected to 32 parallel 8:1 mul¬ 
tiplexers. Each multiplexer commutes continuously and routes each partially computed DCT component by cycling 
through its 3-bit control codes such that the q channel inputs of each of the four row-wise AI-based DCT cores are 
provided with a new set of valid input vectors at rate Fdock- 

The cores are set in parallel being able to compute an 8-point DCT every eight clock cycles of the master clock 
signal. This operation performs the required row-wise DCT computation in order to complete the 2-D DCT evaluation, 
resulting in a doubly encoded AI representation p,q € {a, h, c, }. Fig. |4]shows the above described block. 
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3.5 Final Reconstruction Step 


The output channels for the 64 2-D DCT coefficients are passed through the proposed FRS for decoding the AI- 
encoded numbers back into their fixed-point, binary representation, in 2’s complement format. Two different architec¬ 
tures are proposed for the FRS. 


4 Final Reconstruction Step 


The proposed FRS architectures differ from the one in ll^ by having individualized circuits to compute each output 


value at possibly different precisions. 

Indeed, no FRS circuits are employed in any intermediate 1-D DCT block. This prevents quantization noise cross¬ 
coupling between DCT channels. Any quantization noise is injected only at the final output. Therefore noise signals 
are uncorrelated, which further allows the noise for each output to be independently adjustable and made as low as 
required. 


4.1 FRS BASED ON DEMPSTER-MACLEOD METHOD 


In this method the doubly encoded elements can be decoded according to: 



(4) 


which are then submitted to (|2]). The result is the kth row of the final 2-D DCT data Xi ^, / = 0,1,..., 7. 
Therefore, for each q, (01) unfolds into a particular mathematical expression as shown below: 



(5) 



( 6 ) 


(7) 


( 8 ) 


The summation of above quantities returns Xjj^ (cf. (|2|i). Terms depending on zi and Z 2 may not be rational numbers. 


11 


Indeed, they are given by 


zi = \/2 + V2+y^2-V2 = 2.613125929152... 

Z2 = \J2 + V2- \I 2 -V 2 = 1.082392200292... 
zj=4 + 2-sf2 = 6.828427124746... 

=4-2v^= 1.171572875253... 

ZiZ2 = 2 V 2 = 2.82842712474619... 

ziz2 =4^2-^2 = 3.061467458920... 

ZiZ2 = 4 \/2 + V2 = 7.391036260090... 

, 2,2 _ o 
ZlZ2 — o- 


Multiplier ZjZj = 8 is a power of two and can be represented exactly. Remaining constants require a binary approxi¬ 
mation. 

Closest signed 12-bit approximations can be employed to approximate the above listed numbers. Such approach 
furnished the quantities below: 


669 

zi = ^ =2.61328125, 
~ 437 

z2 = _ =6.828125, 
^ =2.828125, 


z\zi = 


2 ® 

473 

'W 


^ 2217 

^ 2 -^Tr- 

2399 

^ 2 -^- 

3135 

^1^2 ~ 210 ~ 


1.08251953125, 

1.17138671875, 

3.0615234375, 


Consequently, the 12-bit approximation expressions related to are given by: 
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Figure 5: Final reconstruction step blocks with multi-level pipelining, for ( fTOl i and (fTTT i. respectively. 




ZlZ2 : 


26 


■ —-x- 

26 


3135 

'W 




(13) 


Finally, considering the above quantities and applying (|2]), the sought fixed-point representations are fully recov¬ 
ered. Hardware implementation of the multiplier circuits, required by the 12-bit approximations above, is accom¬ 
plished by using the method of Dempster and Macleod 16611671 . This method is known to be optimal for constant 
integer multiplier circuits. 

In this multiplierless method, the minimum number of 2-input adders are used for each constant integer multiplier. 
Wired shifts that perform “costless” multiplications by powers of two are used in each constant integer multiplier. Here, 
an enhancement to the Dempster-Macleod method is made for the constant integer multiplier circuits: the number of 
adder-bits is minimized, rather than the number of 2-input adders, yielding a smaller overall design. 

Accordingly, the multiplications by non powers of two shown in expressions (fTOb - dTSI) can be algorithmically 
implemented as described in Table |2] Fig. |5] and |6] depict the corresponding pipeline implementation. Here, the 
various stages of the pipelined FRS architectures are shown by having FIFO registers (consisting of parallel delay 
flip-flops (D-FFs)) vertically aligned in the figures. Vertically aligned D-FFs indicate the same computation point in a 
pipelined constant coefficient multiplication within the FRS. 
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Figure 6: Final reconstruction step blocks with multi-level pipelining, for (fT^ and (fOl l. respectively. 


Table 2: Fast algorithms for required integer multipliers 


m 

Input: x; Output: y, where y = m-x 

669 

vi = (1 4-2) - x ; V 2 = (1 - 2 ^) • vi ; y = -vi - 2 ^ • V 2 

2217 

vi = (1 4-2"*) -x ; V 2 = (1 4-2) -x; V 3 = vi 4-2^ • V 2 ; 
y = 2^ • vi 4- V 3 

181 

vi = (1 4- 2) • X ; y 2 = 2^ • X 4- vi ; y = 2*’ • vj — V 2 

3135 

vi = (1 4-2 )-x; V 2 = (1 -2^) - x ;y = 2^'^-vi -V 2 

473 

vi = (1 4-2"^) - x ; V 2 = X — 2^ • vi ; y = 2^ -x-f V 2 

437 

vi = (1 4-2^) - x ; V 2 = 2 ^ - x— vi ; y = vi 4-2^ • V 2 

2399 

vi = (1 4-2"^) - x ; V 2 =x 4-2^ • vi ; y = 2 ^- VI - V 2 

8 

y = 2^ • X 
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4.2 FRS BASED ON EXPANSION EACTOR SCALING 


The set of exact values given in (|9|l suggests further relations among those quantities. Indeed, it may be established 
the following relations: 


Zi =4 + Z1Z2, 

z]z 2 = 2- (zi+Zl), 

,2,2 _ o 

Z 1 Z 2 — o- 


Z2=4-ZlZ2, 

Z\zl=2 - {Z 1 -Z 2 ), 


These identities indicate that a new design can be fostered. In fact, by substituting the above relations into (|5]l-(|8]l, we 
have the following expressions: 


-y (a) _ -y (fl) I V I 

^i.k ■^2 H" Z\Z2’i 


Xi,k^>>)zx zi + 

2-XJ‘’'>^‘‘\2 + ZiZ2, 




Xi/‘‘'>ZIZ2 =S-xJ‘^'>‘'‘‘'’+2-(xj‘‘^‘'‘’^ +xj‘^'>^"^^ Zi + 

2- Z2+Xj‘‘)i‘^\,Z2. 

Notice that the output value Xij^ is the summation of the above quantities. Therefore, by grouping the terms on 
{l,zuZ 2 ,ZiZ 2 }, we can express Xi^ by the following summation: 

Xi,k = +Yi/‘’hi +Ya^^h2+Yi/‘‘'>ziZ2, (14) 


where 




+ 


(15) 


15 


(16) 


yJ‘^^ =xJ‘‘'>^"^ +xJ^)‘'‘‘\2-(xJ’’')^'^^- 


Yi/‘^^ =Xt/‘^'>‘'‘‘'^ +Xi/'’^^‘’^ +Xi/^^^"^- 


X, 


,,,(.)W_^,.(.)W+^,,(^)W. 


(18) 


Quantities Yi k^‘i\ q G {a,b,c^d}, require extremely simple arithmetic to be computed. These operations are represented 
by the combinational block in Fig.|7] We now turn to the problem of efficiently evaluate (fT4l l. which depends on zi, 
Z2, andziZ2- 


A possibility is to employ an expansion factor that could simultaneously scale the quantities zi, Z 2 , and ziZ 2 into 
integer values. This would facilitate the usage of integer arithmetic. Such approach has been often employed by 
integer transform designers II68II69I . A good exposition on this method and related schemes is found in ll44l Ch. 5]. 

In mathematical terms, we have the following problem. Let the quantities zi, Z 2 , and ziZ 2 form a vector = 
Z\ Z2 ziZ2 ■ An expansion factor ll44l p. 274] is the real number a* > 1 that satisfies the following minimization 
problem: 


a* = argminlla • —roundfa • ^)||, (19) 

a>l 

where || • || is a given error measure and round(-) is the rounding function. We adopt the Euclidean norm as the error 
measure. The presence of the rounding function introduces several algebraic difficulties. A closed-form solution 
for ( fT9] l is a non-trivial manipulation. Thus, we may resort to computational search. Clearly, additional restrictions 
must be imposed: a limited search space and a given precision for a. 

In the range a £ [1,256] with a precision of 10^^, we could find the optimal value a* = 167.2309. Thus, we have 
the following scaling: 


Z\ 


'436.995521744185...' 


'437' 

Z2 

= 

181.009471802748... 

a; 

181 

Z.lZ2_ 


473.0005442986!..._ 


_473_ 


The error norm is approximately 10^^, which is very low for this type of problem. 

However, notice that small values of a are desirable, since they could scale into small integers, which require 
a simple hardware design. An analysis on the sub-optimal solutions for (fT9b shows that a' = 4.5961 furnishes the 
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Table 3; Booth encoding of the expansion factors a 


a 

Representation 

4.5961 

2^ + 2-^ +2-^ + 2-^ +2-'^ 

167.2309 

2' + 2^ + 2^ - 2^^ + 2-^ - 2-^ - 2-“ 


following scaling: 


Zl 


'12.01031370924931...' 


'12 

Z2 

= 

4.97483482672658... 

ft! 

5 

ZlZ2_ 


12.99986988195626... 


_13_ 


In this case, the resulting integers are relatively small and the error norm is in the order of 10 

Now we are in position to address the computation of (fT^ . Considering a given expansion factor a, we can write: 


Xi.k = 


— +mi + 

a \ 

nil ■ + OT3 • Xi k^'^^^ , 


( 20 ) 


where mi, m 2 , and m-} are the integer constants implied by the expansion factor a. In particular, these constants are 
{437,181,473}, for a — a*, and (12,5,13}, for a = a'. Notice that (l20l l consists of a linear combination. 

Because constants mi, m 2 , and m 3 are integers, associate multiplications can be efficiently implemented in hard¬ 
ware. Considering common subexpression elimination (CSE), these multiplications are reduced to additions and shift 
operations, requiring minimal amount of hardware resources. For the set {437,181,473}, we have the following CSE 
manipulation: 


437 • Ykk^’’'^ -f 181 • + 473 • = 

473 - 

256-I)/). 

This computation requires only eight additions. Analogously, for the set {12,5,13}, CSE yields: 

Yi,k^‘^^+Ykk^‘^\ 


Five additions are necessary. Above calculations are represented by the integer coefficient block in Fig.|7] 

The remaining multiplication in (l20l i is the one by a, which can be implemented according to the Booth encoding 
representation. Table |3]brings the required Booth encoding for a* = 167.2309 and a' = 4.5961. 

The global multiplication by 1 /a is not problematic. Indeed, it can be embedded into subsequent signal processing 
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Figure 7; Block diagram of the proposed AI decoding based on expansion factors. 


Table 4; Success rates of the DCT coefficient computation for various fixed-point bus widths L and tolerance levels 



Percentage Tolerance 

Design 

FRS Method 

L 

10% 

5% 

1% 

0.1% 

0.05% 

0.01% 

0.005% 

1 

Dempster-Macleod 

4 

99.9672 

99.9203 

99.6422 

96.3563 

92.7109 

64.8406 

42.1719 

2 

8 

99.9719 

99.9344 

99.6047 

96.3250 

92.7031 

64.7313 

41.9016 

3 

Expansio 

factor 

„{12,5,13} 

4 

99.1844 

98.2944 

91.6822 

55.1811 

45.0667 

30.6922 

22.8633 

4 

8 

99.1289 

98.2944 

91.4978 

55.0900 

45.0289 

30.7122 

22.8844 

5 

{437,181,473} 

4 

99.9900 

99.9822 

99.9178 

99.1111 

98.2000 

91.0667 

83.1244 

6 

8 

99.9589 

99.9511 

99.8733 

99.0389 

98.1278 

90.9867 

83.1767 


stages after the DCT operation. Typically, it is absorbed into the quantizer. This approach has been employed in several 
DCT architectures 11691 - 1711 . 

Fig.|7]depicts the full block diagram of the discussed computing scheme. Eight separate instances of this block are 
necessary to compute coefficients X/ o to X/j, for each i. 

5 On-FPGA Test and Measurement 

Six designs were implemented on Xilinx ML605 evaluation kit which is populated with a a Xilinx Virtex-6 
XC6VLX240T device. The designs included the three implementations of the 2D 8x8 Arai AI DCT architecture 
with the two types of FRS described in Section |4] for fixed-point 4- and 8-bit wordlengths. Two versions of the ex¬ 
pansion factor FRSs are provided, corresponding to expansion factors a' = 4.5941 and a* = 167.2309, resulting in 6 
designs in total. The proposed designs are listed in Table|4] 

The JTAG interface was used to input the test 8x8 2-D DCT arrays to the device from the Matlab workspace. 
Then the measured outputs were returned to the Matlab workspace via the same interface. Hardware computed 
coefficients were compared to its numerical evaluation furnished by Matlab signal processing toolbox. 
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Slices 


Design 1 
Design 3 
Design 5 

Design 1 
Design 3 
Design 5 

Design 1 
Design 3 
Design 5 

Design 1 
Design 3 
Design 5 

Design 1 
Design 3 
Design 5 

Design 1 
Design 3 
Design 5 

Design 1 
Design 3 
Design 5 

Design 1 
Design 3 
Design 5 

Design 1 
Design 3 
Design 5 



Frequency (MHz) 
1130.41 


309.885 

1312.402 


Quies. power (W) 

I 2.740 
1 2.773 


Dyn. power (W) 
0.897 


0.912 
Total power (W) 

3.637 


3.652 

area x time (slices • fis) 


7.67 
8.34 

area x time^ (slices • jis^) 

I 1 0.025 

0.028 





1.871 


] 4.643 

]21.6I 


] 0.213 


Figure 8: Resource utilization, speed of operation, and power consumption of the DCT designs given in Table [4| on 
Xilinx Virtex-6 XC6VLX240T FPGA for input fixed-point wordlength L = 4. 
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Figure 9; Resource utilization, speed of operation, and power consumption of the DCT designs given in Table |4] on 
Xilinx Virtex-6 XC6VLX240T FPGA for input fixed-point wordlength L = 8. 
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Table 5: Frame rates and block rates achieved by the implemented designs for a video of resolution 1920x 1080 


Desigi 

1 Freq. 
(MHz) 

Block 

rate 

(MHz) 

Frame 
rate (Hz) 

1 

130.410 

16.30 

503.08 

2 

123.120 

15.39 

475.00 

3 

309.855 

38.73 

1195.37 

4 

300.391 

37.55 

1158.95 

5 

312.402 

39.05 

1205.25 

6 

307.787 

38.47 

1187.35 


5.1 On-chip Verification using Success Rates 

As a figure of merit, we considered the success rate defined as the percentage of coefficients which are within the error 
limit of ±e%. For e = {0.005,0.01,0.05,0.1,1,5,10}, the success rates were measured as given in the Tabled Input 
wordlengths L was set to 4 or 8 bits. The 8-bit size is the typical video processing configuration. The proposed AI 
architectures enjoy overflow-free bit-growth at each stage throughout the AI encoded structure thereby ensuring that 
all sources of error are at the FRS and there only. Results show that the FRS based on the expansion factor approach 
for {437,181,473} (Designs 5 and 6) offers a significant improvement in accuracy when compared to remaining FRS 
architectures. 


5.2 FPGA Resource Consumption 

The resource consumption of the proposed architectures on Xilinx Virtex-6 XC6VLX240T device are shown in Fig. [8] 
for L = 4 bits. Fig. |9]brings analogous information for L = 8 bits. Here, FPGA resources are measured in terms of 
slices, slice registers, and slice look-up-tables (LUTs). Designs 3 and 4, which use the FRS based on the expansion 
factor approach for {12,5,13}, consumed the least resources in the device and has the worst accuracy of the three de¬ 
signs (Table|4|i. Moreover, even though Designs 5 and 6 (FRS based on expansion factor approach for {437,181,473}) 
possesses superior accuracy when compared to Designs 1 and 2 (FRS based on Dempster-Macleod method), they con¬ 
sume less hardware resources. Overall the FRS step of the proposed architectures require a considerable amount of 
area when compared to the AI steps of the architecture. 

5.3 Clock Speed, Block Rate, Frame Rate 

Frame rates and block rates achieved by the implemented designs for video at resolution 1920x 1080 is shown in 
Table |5] The design having the best throughput was Design 5, which operates on 4-bit inputs. In Design 5, the 
maximum 8x8 2-D DCT block rate is 39.05 MHz for a 312.402 MHz clock. Assuming an input video resolution of 
1920 X 1080 pixels per frame, we obtained a real-time computation of the 2-D 8 x 8 DCT at 1205.25 frames per second. 
In Design 6, we describe the common 8-bit input case, where the clock is now slightly reduced to 307.787 MHz, 
yielding an 8 x 8 block rate of 38.787 MHz, and a frame rate of 1187.35 frames per second for the same image size as 
above. In all cases, if the 2-D DCT core is eventually embedded in a real-time video processor, the pixel rate is eight 
fold the clock frequency of the DCT core (due to the downsampling by eight in the signal flow graph). For example, a 
potential pixel rate of Ri2.499 GHz and ft!2.462 GHz, for Designs 5 and 6, may be possible. 
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5.4 XiLiNX Power Consumption and Critical Path 


The total power consumption of FPGA circuits consist of the sum of dynamic and quiescent power consumptions. 
Both estimated dynamic and quiescent power consumptions obtained from the design tools for the Xilinx Virtex-6 
XC6VLX240T device are provided in Fig. [Hand Fig.i) 

5.5 Area-Time Complexity Metrics 

Estimates for VLSI area-time complexity metrics are provided for all designs are given in Fig. [8] (T = 4) and Fig. |9| 
{L = 8), respectively. In general, the area-time metric measures complexity of VLSI circuits where chip real-estate is 
important over speed, while metric area-time^ is used often for VLSI circuits where speed is of paramount concern. We 
provide both metrics to offer a broad overview of the area-time complexity levels present in the proposed architectures 
as a function of input size and choice of FRS algorithm. 

The architectures are free of general purpose multipliers. 

5.6 Overall Comparison with Existing Architectures 

Eixed point VLSI implementations that are directly comparable to the proposed architecture are compared in detail 
in Table [6] Table |7| brings comparisons to Al-based architectures. Eor brevity and without loss of generality, we 
chose designs 2 and 6 for the purpose of comparison. These are aimed at 8-bit input signals and are examples of 
the Dempster-Macleod and expansion factor FRS algorithms. A synopsis of both fixed-point and Al-based 2D-DCT 
circuits under comparison in Tables |6|and |2|was provided in Section|2] 

6 Conclusions 

A time-multiplexed systolic-array hardware architecture is proposed for the real-time computation of the bivariate AI 
encoded 2-D Arai DCT The architecture is the first 2-D AI encoded DCT hardware that operates completely in the AI 
domain. This not only makes the proposed system completely multiplier-free, but also quantization free up to the final 
output channels. 

Our architecture employs a novel AI-TB, which facilitates real-time data transposition. The 2-D separable DCT 
operation is entirely performed in the AI domain. Indeed, the architecture does not have intermediate FRS sections 
between the column- and row-wise Al-based Arai DCT operations. This makes the quantization noise only appear at 
the final output stage of the architecture; the single FRS section. 

The location of the FRS at the final output stage results in the complete decoupling of quantization noise between 
the 64 parallel coefficient channels of the 2-D DCT. This fact is noteworthy because it enables the independent selection 
of precision for each of the 64 channels without having any effect on the speed, power, complexity, or noise level of 
the remaining channels. 

Two algorithms for the FRS are proposed, numerically optimized, analyzed, hardware implemented, and tested 
with the proposed 2-D AI encoded section. The architectures are physically implemented for input precision of 4 
and 8 bits, and fully verified on-chip. Of particular relevance is the commonly required 8-bit realization, which is 
operational at a clock frequency of 307.787 MHz on a Xilinx Virtex-6 XC6VLX240T FPGA device (see Design 6). 
This implies a 8 x 8 block rate of 38.47 MHz and a potential pixel rate of k,2A62 GHz if the proposed 2-D DCT core 
is embedded in a real-time video processing system. The frame rate for standard HD video at 1920 x 1080 resolution 
is Ril 187.35 Hz assuming 8-bit input words and core clock frequency of 307.787 MHz. 
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Table 6: Comparison of the proposed implementation with published hxed point implementations 



Lin et al. 

Shams et; 

ED 

IMadisetti 
etal. Ct) 

Guo et al. 

M 

Tumeo et 

M 

dSun et al. 

Eo) 

Chen et al 

m 

Proposed architectures 

Design 2 

Design 6 

Measured 

results 

No 

No 

No 

No 

No 

No 

No 

Yes 

Yes 

Structure 

Single 

2-D 

DCT 

Two 

1-D 

DCT 

h-TMEM'^ 

Single 

1-D 

DCT 

H-TMEM+ 

Single 

1-D 

DCT 

h-TMEM^^ 

Single 

1-D 

DCT 

h-TMEM^^ 

Two 

1-D 

DCT 

H-TMEM+ 

Single 

1-D 

DCT 

H-TMEM+ 

See 

Fig.|3] 

See 

Fig.|3] 

Multipliers 

1 

0 

7 

0 

4 

0 

0 

0 

0 

Operating 

frequency 

(MHz) 

100 

N/A 

100 

no 

107 

149 

167 

123.12 

307.79 

8 x8 Block ra 
xlO®s-i 

® 1.5625 

N/A 

1.562 

3.4375 

1.3375 

2.328 

2.609 

15.39* 

38.625* 

Pixel rate 
xlOV^ 

100 

N/A 

100 

220 

85.6 

149 

167 

984.96* 

2462.32* 

Implementati 

technology 

rtf).13)im 

CMOS 

N/A 

0 .8)rm 

CMOS 

0.35)im 

CMOS 

Xilinx 

XC2VP3( 

Xilinx 

XC2VP3( 

0.18)im 

CMOS 

Xilinx 

XC6VLX240T 

Xilinx 

XC6VLX240T 

Coupled 

quantiza¬ 

tion 

noise 

Yes 

Yes 

Yes 

Yes 

Yes 

Yes 

Yes 

No 

No 

Independent! 

adjustable 

precision 

No 

No 

No 

No 

No 

No 

No 

Yes 

Yes 


^ Row column transpose buffer. * Block rate=/v/ocii/8. ^ Pixel rate = 
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Table 7: Comparison of the proposed implementation with published algebraic integer implemen¬ 
tations 



Nandi et al. 

m 

Jullien et al. 

El 

Wahid et al. 

ISl 

Proposed architectures 

Design 2 

Design 6 

Measured 

results 

No 

No 

No 

Yes 

Yes 

Structure 

Single 1-D 
DCT 

H-Mem. 

bank 

Two 1-D 

DCT H-Dual 
port RAM 

Two 1-D 

DCT 

H-TMEM+ 

See Fig.H 

See Fig.|3] 

Multipliers 

0 

0 

0 

0 

0 

Exact 2D AI 
computation 

No 

No 

No 

Yes 

Yes 

Operating 

frequency 

(MHz) 

N/A 

75 

194.7 

123.12 

307.79 

8 x8 Block 
rate xlO®s^^ 

7.8125 

1.171 

3.042 

15.39* 

38.625* 

Pixel rate 

xlOV' 

125 

75 

194.7 

984.96^ 

2462.32^ 

Implementation 

technology 

Xilinx 

XC5VLX30 

0.18/im 

CMOS 

0.18/im 

CMOS 

Xilinx 

XC6VLX240 

Xilinx 

rXC6VLX240r 

Coupled 

quantization 

noise 

Yes 

Yes 

Yes 

No 

No 

Independently 

adjustable 

precision 

No 

No 

No 

Yes 

Yes 

FRS between 
row-column 
stages 

No 

Yes 

Yes 

No 

No 


^ Row column transpose buffer. * Block rate=^ Pixel rate = 
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The proposed architecture achieves complete elimination of quantization noise coupling between DCT coefficients, 
which is present in published 2-D DCT architectures based on both fixed-point arithmetic as well as row-column 8- 
point Aral DCT cores that have FRS sections between row- and column-wise transforms. The proposed designs allows 
each of the 64 coefficients to be computed at 64 different precision levels, where each choice of precision only affects 
that particular coefficient. This allows full control of the 2-D DCT computation to any degree of precision desired by 
the designer. 
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